Statistical inference using weights and survey design

3. R and Stata examples

Pierre Walthéry

UK Data Service

October 2025

Plan

- Three sessions:

  1. Survey design: a refresher

  2. Inference in theory and practice

3. R and Stata examples

This session

  1. R vs Stata speak
  2. Means, proportions, CI
  3. When survey design variables are not present

Data and software requirements

1. R vs Stata speak

Survey design estimation in R

  • The R Survey package (Lumley 2024) provides a comprehensive set of functions for computing point and variance estimates from survey data.

  • Overall logic:

    • Install the survey packages and load it into memory
    • Declare the survey design: i.e. create a svydesign object from the data
         mydata.s<-svydesign(id=~psu_var,
                             strata=~strata_var, 
                             weights=weight_var,
                             mydata)`
  • Compute estimates with survey-specific functions:

    • svymean(myvar,mydata.s)
    • svytab(myvar,mydata.s)

Command-based weighting in R

  • R does not provide a unified set of functions for computing command-based weighted estimates.

  • Implementation of weighting may vary between packages, but algorithms are usually described in detail in the package documentation.

  • Base R has only one weight-aware function: weighted.mean()

  • Some R packages offer a more comprehensive set of weighted estimation functions:

    • Hmisc: wtd.mean(), wtd.var(), wtd.quantile()
    • summarytools: descr(), freq()
  • Confidence intervals and standard errors still have to be computed manually

Survey design in Stata

  • Stata provides comprehensive support for computing survey design-informed estimates from survey data

  • Implementation logic similar to R:

    • Declare the survey design using svyset

      • svyset id=psu_var [pweights=weight_var],strata(strata_var)
    • Subsequently use svy: - prefixed commands for estimation:

      • svy:mean myvar, svy:tab myvar etc…

Stata command-based weighting - 1

  • Users may add survey weights to most Stata estimation commands, or use survey-specific commands. The latter is recommended.

  • Stata distinguishes between four kinds of (dealing with) weights:

    • frequency weights (fweight),
    • analytical weights (aweight),
    • importance weights (iweight) and
    • probability weights (pweight).
  • These mostly differ in the way variance standard is computed

Stata command-based weighting - 2

  • Survey weights should be treated as probability weights or pw. .

  • Command-based weighting (i.e. not using survey design functions) in Stata consists in the weighting variables being specified between square brackets.

    • stata_command myvar [pw=weight_var]
  • Common estimation commands, such as summarise or tab do not allow using pw

    • \(\longrightarrow\) This is to nudge users to rely on the svy: commands instead
  • Some prefer to specify the wrong kind of weighting function instead (fw or aw) in order to avoid using the survey design functions. You may get the correct point estimates, but your variance/standard errors are likely to be incorrect.

    • Do this at your own risk!

What about longitudinal weights?

  • Longitudinal data can either be analysed:

    • Cross-sectionaly (i.e. a single wave of data) \(\longrightarrow\) cross-sectional weights

    • Longitudinally (i.e. two or more waves of data) \(\longrightarrow\) check your data documentation for the suitable weights:

      • Longitudinal weights may have been computed to account for attrition between two waves only
      • … Or several of them…
      • The idea is to use the longitudinal weights from the last wave of data being used
  • From a software point of view, the above survey estimation functions apply to longitudinal data (with the correct weights!)

2. Means,
proportions,
confidence intervals

Identifying the survey design

We first need to find out about the survey design that was used in the 2017 BSA, and the design variables that are made available in the dataset. Such information can usually be found in the documentation that comes together with the data under the mrdoc/pdf folder.

Question 1

  • What is the design that was used in this survey

    • How many stages were there, and what were the population units sampled?
    • What were the primary sampling units; the strata (if relevant)?

Answer to question 1

  • The 2017 BSA is a three stage stratified random survey, with postcode sectors, adresses and individuals as the units selected at each stage.

  • Primary sampling units were furthermore stratified according to geographies (sub regions), population density, and proportion of owner-occupiers.

  • Sampling fraction (or proportion) was proportional to the size of postcode sectors (ie number of addresses)

Finding the survey design variables

Now that we are a bit more familiar with the way the survey was designed, we can to try and identify the design variables available in the dataset. The information can usually be found in the user manual or the data dictionary available under mrdoc/ukda_data_dictionaries.zip The file may need to be decompressed separately.

Question 2

  • What survey design variables are available?
  • Are there any that are missing – if so which ones?
  • What is the name of the weights variables?

Q2 Answers

  • From the Data Dictionary it appears that:

    • the primary sampling units (sub regions) are identified bySpoint
    • the strata byStratID
    • the weights variable is WtFactor.
  • Addresses are not provided but could be approximated with a household identifier.

  • Searching directly from within R or Stata may also be an option

Preparing the data - R

rm(list=ls())
library(dplyr)                                                              ### Data manipulation functions
library(haven)                                                              ### Importing stata/SPSS files
library(Hmisc)                                                              ### Extra statistical functions
library(survey)                                                             ### Survey design functions
library(ggplot2)
setwd("~/OneDrive/trainings/Inference_workshops/Manchester_Nov24/")         ### Edit as appropriate
datadir<-"~/Data/"                                                          ### Edit as appropriate
bsa17<-read_dta(paste0(datadir,"bsa/UKDA-8450-stata/bsa2017_for_ukda.dta")) ### Importing Stata format dataset
dim(bsa17)                                                                  ### Check the dataset
[1] 3988  580
### You can also use `View()`
  • We can now specify the survey design using:

    • Spoint as Primary Sampling Unit ids,
    • StratID as strata ids,
    • WtFactor as weights.

Specifying the survey design in R (1)

  • We create a svydesign object, i.e. a survey design informed copy of the data, which will be used for subsequent estimation.
bsa17.s<-svydesign(ids=~Spoint, 
                   strata=~StratID, 
                   weights=~WtFactor,
                   data=bsa17)        ### Specifying the survey design
class(bsa17.s)                        ### Examining the svydesign object                                        
[1] "survey.design2" "survey.design" 

Specifying the survey design in R (2)

  • We can inspect the content of bsa17.s. Beware, the output is really large!
summary(bsa17.s)                      ### ... And looking at its content
Stratified 1 - level Cluster Sampling design (with replacement)
With (372) clusters.
svydesign(ids = ~Spoint, strata = ~StratID, weights = ~WtFactor, 
    data = bsa17)
Probabilities:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2645  0.8288  1.0983  1.2386  1.6236  3.3318 
Stratum Sizes: 
           101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
obs         18  22  30  18  16  21  22  37  10  22  19  35  23  19  19  21  25
design.PSU   2   2   3   2   2   2   2   3   2   3   2   3   2   2   2   2   2
actual.PSU   2   2   3   2   2   2   2   3   2   3   2   3   2   2   2   2   2
           118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
obs         12  12  32  40  25  21  23  26  23  18  34  23  20  29  39  19  30
design.PSU   2   2   3   3   3   2   2   2   3   2   2   2   2   3   3   2   3
actual.PSU   2   2   3   3   3   2   2   2   3   2   2   2   2   3   3   2   3
           135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
obs         20  10  21  12  26  16  20  17  21  24  30  30  18  29  24  19  28
design.PSU   2   2   2   2   3   2   2   2   2   3   2   3   2   3   2   3   2
actual.PSU   2   2   2   2   3   2   2   2   2   3   2   3   2   3   2   3   2
           152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
obs         18   8  23  33  14  23  17  39  13  22  16  19  21  18  26  13  14
design.PSU   2   2   2   3   2   2   2   3   2   2   2   2   2   2   3   2   2
actual.PSU   2   2   2   3   2   2   2   3   2   2   2   2   2   2   3   2   2
           169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
obs         22  20   8  22  31  22  24  19  38  20  29  24  29  21  23  32  36
design.PSU   2   2   2   2   2   2   2   2   3   2   2   2   3   2   2   3   3
actual.PSU   2   2   2   2   2   2   2   2   3   2   2   2   3   2   2   3   3
           186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
obs         24  22  43  38  38  47  34  15  22  35  17  20  20  21  21  43  35
design.PSU   3   2   3   3   3   3   3   2   2   3   2   2   2   2   3   3   3
actual.PSU   3   2   3   3   3   3   3   2   2   3   2   2   2   2   3   3   3
           203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
obs         28  25  19  18  28  15  21  30  24  33  24  22  30  24  44  18  26
design.PSU   3   3   2   2   2   2   2   2   2   3   2   2   3   2   3   2   2
actual.PSU   3   3   2   2   2   2   2   2   2   3   2   2   3   2   3   2   2
           220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236
obs         22  28  20  27  34  33  41  24  23  26  17  23  36  20  45  32  27
design.PSU   2   2   2   3   2   3   3   2   2   2   2   2   3   2   3   3   3
actual.PSU   2   2   2   3   2   3   3   2   2   2   2   2   3   2   3   3   3
           237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253
obs         33  25  39  31  29  33  20  43  22  24  26  29  37  22  27  25  43
design.PSU   3   3   3   3   2   2   2   3   2   2   2   2   3   2   2   2   3
actual.PSU   3   3   3   3   2   2   2   3   2   2   2   2   3   2   2   2   3
           254 255 256 257 258 259
obs          7  32  26  25  28  35
design.PSU   2   3   2   2   2   3
actual.PSU   2   3   2   2   2   3
Data variables:
  [1] "Sserial"              "Spoint"               "StratID"             
  [4] "WtFactor"             "OldWt"                "GOR_ID"              
  [7] "ABCVer"               "Country"              "househlde"           
 [10] "hhtypee"              "Rsex"                 "RAgeE"               
 [13] "RAgeCat"              "RAgeCat2"             "RAgecat3"            
 [16] "RAgecat4"             "RAgecat5"             "RSexAge"             
 [19] "RSexAge2"             "MarStat"              "Married"             
 [22] "legmarste"            "ChildHh"              "nch415e"             
 [25] "nch318e"              "hhch04e"              "hhch511e"            
 [28] "hhch1215e"            "hhch1617e"            "rch04e"              
 [31] "rch511e"              "rch1215e"             "rch1617e"            
 [34] "ownche"               "reconacte"            "RLastJob"            
 [37] "seconacte"            "Readpap"              "WhPaper"             
 [40] "paptype"              "TVNews"               "WebNews"             
 [43] "WNwSite1"             "WNwSite2"             "SMNews"              
 [46] "Internet"             "IntPers"              "MedResI"             
 [49] "SupParty"             "ClosePty"             "PartyIDN"            
 [52] "Partyid1"             "PartyId2"             "PartyID3"            
 [55] "PtyAlleg"             "Idstrng"              "Politics"            
 [58] "Coalitin"             "ConLabDf"             "VoteSyst"            
 [61] "ScotPar2"             "ECPolicy2"            "GovTrust"            
 [64] "Monarchy"             "MiEcono"              "MiCultur"            
 [67] "Spend1"               "Spend2"               "SocSpnd1"            
 [70] "SocSpnd2"             "SocSpnd3"             "SocSpnd4"            
 [73] "SocSpnd5"             "SocSpnd6"             "Dole"                
 [76] "TaxSpend"             "IncomGap"             "SRInc"               
 [79] "CMArran"              "RBGaran2"             "SepInvol"            
 [82] "SepServ"              "WkMent"               "WkPhys"              
 [85] "HProbRsp"             "PhsRetn"              "PhsRecov"            
 [88] "MntRetn"              "MntRecov"             "HCWork21"            
 [91] "HCWork22"             "HCWork23"             "HCWork24"            
 [94] "HCWork25"             "HCWork26"             "HCWork27"            
 [97] "HCWork28"             "HCWork29"             "NatFrEst"            
[100] "FalseBn2"             "RepFrau3"             "RepWho1"             
[103] "RepWho2"              "RepWho3"              "RepWho4"             
[106] "RepWho5"              "RepWho6"              "RepWho7"             
[109] "RepWho8"              "RepWho9"              "RepWho10"            
[112] "WhyNRep1"             "WhyNRep2"             "WhyNRep3"            
[115] "WhyNRep4"             "WhyNRep5"             "WhyNRep6"            
[118] "WhyNRep7"             "WhyNRep8"             "WhyNRep9"            
[121] "BFPnsh1"              "BFPnsh2"              "BFPnsh3"             
[124] "BFPnsh4"              "BFPnsh5"              "BFPnsh6"             
[127] "BFPnsh7"              "BFPnsh8"              "BFPnsh9"             
[130] "BFPnsh10"             "BFPnsh11"             "AwrPB"               
[133] "AdminPn2"             "LosofBen"             "AwrCRec"             
[136] "GovDoBF"              "ImpHDoc"              "ImpHPar"             
[139] "ImpHBeha"             "ImpHFam"              "ImpHEd"              
[142] "ImpHJob"              "ImpHNeig"             "ImpHArea"            
[145] "ImpHSafe"             "RespoHl2"             "HomsBult"            
[148] "YSBEmpl"              "YSBTrans"             "YSBGreen"            
[151] "YSBSch"               "YSBAfRnt"             "YSBAfOwn"            
[154] "YSBDesig"             "YSBShops"             "YSBMedic"            
[157] "YSBLibry"             "YSBLeis"              "YSBFinan"            
[160] "YSBOther"             "YSBDeps"              "YSBNone"             
[163] "HousGSD"              "Buldres"              "EdSpnd1c"            
[166] "EdSpnd2c"             "VocVAcad"             "ATTD151"             
[169] "ATTD152"              "ATTD153"              "ATTD154"             
[172] "ATTD155"              "ATTD156"              "ATTD157"             
[175] "ATTD158"              "ATTD81"               "ATTD82"              
[178] "ATTD83"               "ATTD84"               "ATTD85"              
[181] "ATTD86"               "ATTD87"               "ATTD88"              
[184] "GCSEFur"              "GCSEWrk"              "ALevFur"             
[187] "ALevWrk"              "HEdOpp"               "ChLikUn2"            
[190] "HEFee"                "FeesUni"              "FeesSub"             
[193] "Himp"                 "PREVFR"               "TRFPB6U"             
[196] "TRFPB9U"              "TrfPb10u"             "TrfConc1"            
[199] "DRIVE"                "carnume"              "CycDang"             
[202] "Bikeown2"             "BikeRid"              "TRAVEL1"             
[205] "TRAVEL2"              "TRAVEL3"              "TRAVEL4a"            
[208] "TRAVEL6"              "airtrvle"             "CCTrans1"            
[211] "CCTrans2"             "CCTrans3"             "CCTrans4"            
[214] "CCTrans5"             "CCTrans6"             "CCTrans7"            
[217] "CCTrans8"             "CCTrans9"             "CCALowE"             
[220] "CCACar"               "CCAPLANE"             "CCBELIEV"            
[223] "EUBrld"               "EUExInf2"             "EUExUne2"            
[226] "EUExIm2"              "EUExEco2"             "EUImpSov"            
[229] "LeavEUI"              "EUconte"              "EUcontu"             
[232] "EUconth"              "EULtop1"              "EULtop2"             
[235] "EULtop3"              "NHSSat"               "WhySat1"             
[238] "WhySat2"              "WhySat3"              "WhySat4"             
[241] "WhySat5"              "WhySat6"              "WhySat7"             
[244] "WhySat8"              "WhySat9"              "WhySat10"            
[247] "WhyDis1"              "WhyDis2"              "WhyDis3"             
[250] "WhyDis4"              "WhyDis5"              "WhyDis6"             
[253] "WhyDis7"              "WhyDis8"              "WhyDis9"             
[256] "WhyDis10"             "GPSat"                "DentSat"             
[259] "InpatSat"             "OutpaSat"             "AESat"               
[262] "CareSat3"             "NHSFProb"             "NHS5yrs"             
[265] "NHSNx5Yr"             "NHSAcc"               "NHSImp"              
[268] "AEtravel"             "CareNee2"             "PaySocia"            
[271] "CarePa2"              "SocFutur"             "Tranneed"            
[274] "Prejtran"             "PMS"                  "HomoSex"             
[277] "SSRel"                "RSuperv"              "rocsect2e"           
[280] "REmpWork"             "REmpWrk2"             "SNumEmp"             
[283] "WkJbTim"              "ESrJbTim"             "SSrJbTim"            
[286] "WkJbHrsI"             "ExPrtFul"             "EJbHrCaI"            
[289] "SJbHrCaI"             "RPartFul"             "S2PartFl"            
[292] "Remplyee"             "UnionSA"              "TUSAEver"            
[295] "NPWork10"             "RES2010"              "RES2000"             
[298] "SLastJb2"             "S2Employ"             "S2Superv"            
[301] "S2ES2010"             "S2ES2000"             "rjbtype"             
[304] "REconSum"             "REconPos"             "RNSEGGrp"            
[307] "RNSocCl"              "RNSSECG"              "RClass"              
[310] "RClassGp"             "RSIC07GpE"            "seconsum"            
[313] "S2NSEGGp"             "S2NSSECG"             "S2NSocCl"            
[316] "S2Class"              "S2ClassG"             "WAGMIN"              
[319] "RESPPAY"              "TRCURJM"              "TRCURJN"             
[322] "TRMRSJM"              "TRMRSJN"              "TRDIFJM"             
[325] "TRDIFJN"              "PHOURS"               "REGHOUR"             
[328] "WRKCON"               "JBMRESP"              "JBMWH1"              
[331] "JBMWH2"               "JBMWH3"               "JBMWH4"              
[334] "JBMWH5"               "JBMWH6"               "JBMWH7"              
[337] "JBMWH8"               "FLEXHRS"              "MgCWld"              
[340] "MgMWld"               "ChgAsJb1"             "ChgAsJb2"            
[343] "ChgAsJb3"             "ChgJbTim"             "RetExp"              
[346] "RetExpb"              "DVRetAge"             "PenKnow2"            
[349] "RPenSrc1"             "RPenSrc2"             "RPenSrc3"            
[352] "whrbrne"              "NatIdGB"              "NatId"               
[355] "tenure2e"             "RentPrf1"             "HAWhat"              
[358] "HAgdbd"               "HANotFM"              "LikeHA"              
[361] "HAYwhy"               "HANwhy"               "HsDepnd"             
[364] "ResPres"              "ReligSum"             "RlFamSum"            
[367] "ChAttend"             "bestnatu2"            "raceori4"            
[370] "DisNew2"              "DisAct"               "DisActDV"            
[373] "Knowdis1"             "Knowdis2"             "Knowdis3"            
[376] "Knowdis4"             "Knowdis5"             "Knowdis6"            
[379] "Knowdis7"             "DisPrj"               "Dis100"              
[382] "tea3"                 "HEdQual"              "HEdQual2"            
[385] "HEdQual3"             "EUIdent"              "BritID2"             
[388] "Voted"                "Vote"                 "EURefV2"             
[391] "EUVOTWHO"             "EURefb"               "AnyBN3"              
[394] "MainInc5"             "HHIncD"               "HHIncQ"              
[397] "REarnD"               "REarnQ"               "SelfComp"            
[400] "knwbdri"              "knwexec"              "knwclea"             
[403] "knwhair"              "knwhr"                "knwlaw"              
[406] "knwmech"              "knwnurs"              "knwpol"              
[409] "knwtchr"              "incdiffs"             "incdsml"             
[412] "govldif"              "socblaz"              "whoprvhc"            
[415] "whoprvca"             "actgrp"               "actpol"              
[418] "actchar"              "govnosa2"             "hhldjob"             
[421] "hhmsick"              "hdown"                "hadvice"             
[424] "hsococc"              "hlpmny"               "hlpjob"              
[427] "hlpadmin"             "hlplive"              "hlpill"              
[430] "lckcomp"              "isolate"              "leftout"             
[433] "peopadvt"             "peoptrst"             "trstcrts"            
[436] "trstprc"              "helpeldy"             "helpslf1"            
[439] "helpfrnd"             "fampress"             "reltdemd"            
[442] "ffrangr"              "eatout"               "newfrnd"             
[445] "pplcont"              "pplftf"               "parcont"             
[448] "sibcon2"              "chdcon2"              "othcont"             
[451] "frndcont"             "contint"              "ltsgnhth"            
[454] "depres"               "diffpile"             "acgoals"             
[457] "lifesat2"             "makeem"               "langgs"              
[460] "helpslf2"             "payback"              "domconv"             
[463] "sitwhr"               "hmecont"              "religcon"            
[466] "spseedu"              "ben3000"              "ben3000d"            
[469] "falcatch"             "uniaff"               "unicar"              
[472] "bothearn"             "sexrole"              "womworka"            
[475] "womworkb"             "parlvmf2"             "gendwrk"             
[478] "gendmath"             "gendcomp"             "sxbstrm"             
[481] "sxbintm"              "sxbstrw"              "sxbintw"             
[484] "sxblaw"               "sxbprov"              "sxboffb"             
[487] "sxbnoone"             "sxboth"               "sxbcc"               
[490] "carwalk2"             "carbus2"              "carbike2"            
[493] "shrtjrn"              "plnallow"             "plnterm"             
[496] "plnenvt"              "plnuppri"             "cartaxhi"            
[499] "carallow"             "carreduc"             "carnod2"             
[502] "carenvdc"             "resclose"             "res20mph"            
[505] "resbumps"             "ddnodrv"              "ddnklmt"             
[508] "specamsl"             "specammo"             "specamtm"            
[511] "speedlim"             "speavesc"             "mobdsafe"            
[514] "mobddang"             "mobdban"              "mobdlaw"             
[517] "eutrdmv"              "consvfa"              "labrfa"              
[520] "libdmfa"              "ukipfa"               "rthdswa2"            
[523] "rthdsaw2"             "rthdsca2"             "rthdssa2"            
[526] "rthdsprd"             "eqrdisab"             "nhsoutp2"            
[529] "nhsinp2"              "bodimr"               "bodimop"             
[532] "girlwapp"             "tprwrong2"            "eulunem"             
[535] "eulimm"               "eulecon"              "eulwork"             
[538] "eullowi"              "eulmlow"              "eulnhs"              
[541] "jbernmny"             "jbenjoy"              "topupchn"            
[544] "topupnch"             "topuplpa"             "worknow"             
[547] "losejob"              "jbgdcurr"             "robots"              
[550] "robown"               "voteduty"             "welfhelp"            
[553] "morewelf"             "unempjob"             "sochelp"             
[556] "dolefidl"             "welffeet"             "damlives"            
[559] "proudwlf"             "redistrb"             "BigBusnn"            
[562] "wealth"               "richlaw"              "indust4"             
[565] "tradvals"             "stifsent"             "deathapp"            
[568] "obey"                 "wronglaw"             "censor"              
[571] "leftrigh"             "libauth"              "welfare2"            
[574] "libauth2"             "leftrig2"             "welfgrp"             
[577] "eq_inc_deciles"       "eq_inc_quintiles"     "eq_bhcinc2_deciles"  
[580] "eq_bhcinc2_quintiles"

Mean age and its 95% CI

  • We can now produce a first set of estimates using the survey design information. We begin with the mean age of respondents.

  • We will need to use svymean()

svymean(~RAgeE,bsa17.s)
        mean     SE
RAgeE 48.313 0.4236

By default svymean() computes the standard error of the mean. We need to
embed it within confint() in order to get a confidence interval.

confint(                                  ### Computing the confidence interval...
  svymean(~RAgeE,bsa17.s)                 ### And the mean
  )                   
         2.5 %  97.5 %
RAgeE 47.48289 49.1433
  • For a slightly nicer output:
Code
round(                                    ### Rounding the results to one decimal point
  c(
    svymean(~RAgeE,bsa17.s),              ### Computing the mean...
    confint(svymean(~RAgeE,bsa17.s))      ### And its 95% CI
    ),
  1) 
RAgeE             
 48.3  47.5  49.1 

Question 3

  • What would be the consequences of:

    • weighing but not accounting for the sample design;
    • neither using weights or accounting for the sample design?
  • When:

    • inferring the mean age in the population?
    • computing the uncertainty of this estimate?

Q. 3 answer (1): command-based weighting

  • We need to compute means and CI separately:
a.m<-wtd.mean(                               ### Weighted mean function from the
  bsa17$RAgeE,                               ### Hmisc package
  bsa17$WtFactor,
  normwt = T)                                ### Option specific to survey weights  

### Computation of the standard error by hand...
a.se<-sqrt(
          wtd.var(bsa17$RAgeE,               ### ... using the weighted variance function from Hmisc
                  bsa17$WtFactor,
                  normwt = T)
          )/
       sqrt(
         nrow(bsa17)                         ### ... shortcut to sample size
         )

c(a.m,                                       ### Concatenating the mean..
  a.m-1.96*a.se,                             ### ... the lower bound of the CI
  a.m+1.96*a.se)                             ### ... and the higher bound
[1] 48.31309 47.73719 48.88900

Q. 3 answer (2): unweighted estimates

ua.m<-mean(bsa17$RAgeE)                     ### mean() function from R Base
  
ua.se<-sd(bsa17$RAgeE)/                     ### ... standard errors
      sqrt(nrow(bsa17))    ##

c(ua.m,                                     ### and putting it all together
  ua.m-1.96*ua.se,
  ua.m+1.96*ua.se
  )
[1] 52.19358 51.63057 52.75659

Question 3 answer: Summary

  • Not using weights results in overestimating the mean age in the population (of those aged 18+) by about 4 years.
  • This might be due to the fact that older respondents are more likely to take part to surveys.
  • Using command-based weighting does not alter the value of the estimated population mean when compared with SD informed estimates…
  • … but would lead us to overestimating the precision/underestimate the uncertainty of our estimate – by about plus or minus 3 months.

Computing a proportion and its 95% confidence interval

  • We can similarly estimate the distribution of a categorical variable in the population by estimating proportions (or percentages)
  • Let’s look at the proportion of people who declare that they are interested in politics.
  • This is the Politics variable in the BSA.
  • It has five categories ranging from 1: ‘A great deal’ to 5: ‘Not at all’.
  • We could recode together 1 and 2, i.e. A great deal and quite a lot into Significantly, but here we will directly select the relevant values on the fly as this is allowed by R.

Let’s explore the variable

  • Phrasing of the question:
attr(bsa17$Politics,"label")     
[1] "How much interest do you have in politics?"
  • Sample distribution
table(
  droplevels(                                    ### We are using droplevels() 
             as_factor(bsa17$Politics, "both")   ### in order to hide categories
             )                                   ### ... without any observations
  ) 

[1] ... a great deal,      [2] quite a lot,             [3] some, 
                  739                   982                  1179 
   [4] not very much,  [5] or, none at all?        [8] Don`t know 
                  708                   379                     1 

  • Let’s infer the percentage of those who were interested in politics in 2017
  round(100*
    prop.table(
      svytable(~(Politics==1 | Politics==2),bsa17.s)
      ),1)
Politics == 1 | Politics == 2
FALSE  TRUE 
   57    43 
Alternative:
bsa17$PolC<-as.factor(ifelse(bsa17$Politics==1 | bsa17$Politics==2, "Interested", "Not so much") )
bsa17.s<-svydesign(ids=~Spoint, 
                   strata=~StratID, 
                   weights=~WtFactor,
                   data=bsa17)        

  round(100*
    prop.table(
      svytable(~(PolC),bsa17.s)
      ),1)
PolC
 Interested Not so much 
         43          57 
  • Let us now estimate the confidence intervals for these proportions. Software like Stata or SPSS usually doesn’t show us what is happening under the bonnet. R requires more coding, but also enables us to better understand what we are actually estimating.

  • Confidence intervals for proportions of categorical variables are in fact computed as a sequence of binomial/dichotomic estimations – i.e. one for each category.

  • In R we specify this via svyciprop() and I():

    • The former computes the proportion and its confidence interval (by default 95%)…
    • … the latter allows us to define the category of interest of a polytomous variable.
    • As before, we could have used a recoded dichotomic variable instead
p<-svyciprop(
            ~I(Politics==1 | Politics==2),
            bsa17.s)
p
                                        2.5% 97.5%
I(Politics == 1 | Politics == 2) 0.430 0.411 0.449
  • A neater version:
round(100*
        c("% interested"= p[[1]],                                          ### Extracts the point estimate
          "95% CI - LB"=attr(p,"ci")[[1]],"95% CI - UB"=attr(p,"ci")[[2]]  ### Extracts the CI
          ),1 
      )
% interested  95% CI - LB  95% CI - UB 
        43.0         41.1         44.9 

Question 4

  • What is the proportion of respondents aged 17-34 in the sample, as well as its 95% confidence interval?

    • You can use RAgecat5

Answer

  • The proportion of 17-34 year old in the sample is:
a<-svyciprop(~I(RAgecat5 == 1),
              bsa17.s)
a
                        2.5% 97.5%
I(RAgecat5 == 1) 0.285 0.265 0.306
round(
  100*a[1],
     1)
I(RAgecat5 == 1) 
            28.5 

and its 95% confidence interval:

round(
    100*
    confint(a),  ### Another way of extracting CI from svyciprop objects
    1)
                 2.5% 97.5%
I(RAgecat5 == 1) 26.5  30.6

Computing domain estimates

  • Computing estimates for subpopulations adds a layer of complexity to what we have seen so far.

  • Weights are usually designed to make the full sample representative of the population

  • If we computed estimates for a subpopulation only:

    • this would amount to using a fraction of these weights
    • and may alter the accuracy of our estimates.
  • It is instead recommended to use commands that take into account the entire distribution of the weights.

    • The R command that does this is svyby()

Computing domain estimates in R

  • Say we would like to compute the mean age of BSA respondents by Government Office Regions

  • We need to specify:

    • The outcome variable whose estimate we want to compute: i.e. RAgeE
    • The grouping variable(s) GOR_ID
    • The estimation function we are going to use here: svymean()
    • And the type of type of variance estimation we would like to see displayed i.e. standard errors or confidence interval

Output

d<-      svyby(~RAgeE,               ### Outcome variable
             by=~as_factor(GOR_ID),  ### Subpopulations
             svymean,                ### Estimation function
             design=bsa17.s,
             vartype = "ci")         ### CI or SE 
round(d[-1],1)
                           RAgeE ci_l ci_u
A North East                46.1 43.6 48.6
B North West                49.6 47.3 52.0
D Yorkshire and The Humber  48.0 45.2 50.8
E East Midlands             48.6 45.9 51.3
F West Midlands             48.1 45.0 51.2
G East of England           49.0 46.0 52.0
H London                    45.0 43.0 46.9
J South East                48.0 45.1 50.8
K South West                53.4 51.5 55.2
L Wales                     49.1 45.1 53.1
M Scotland                  47.3 44.7 50.0

We used [-1] above to remove the column with region names from the results, so that we could round the estimates without getting an error.

Interpretation

  • The population in London is among the youngest in the country whereas those in the South West are among the oldest
  • This is likely to be statistically significant as their 95% CI do not overlap.
  • The same cannot be said of differences between London and the South East, as the CIs partially overlap.

Proportions

  • The approach for proportion is similar: i.e. specify categories of interest, for example respondents aged 17-34, and replace svymean by svyciprop.
c<-svyby(~I(RAgecat5==1),
            by=~as_factor(GOR_ID),
            svyciprop,
            design=bsa17.s,
            vartype = "ci")

round(100*c[-1],1)
                           I(RAgecat5 == 1) ci_l ci_u
A North East                           35.2 25.8 45.9
B North West                           26.3 20.8 32.8
D Yorkshire and The Humber             31.4 25.0 38.7
E East Midlands                        27.4 20.7 35.4
F West Midlands                        30.6 24.2 37.9
G East of England                      25.3 19.5 32.0
H London                               31.5 26.0 37.7
J South East                           28.4 21.3 36.7
K South West                           18.0 12.7 25.0
L Wales                                27.0 18.2 38.2
M Scotland                             34.0 27.5 41.1

Visualising CIs

Let’s now plot this output:

ggplot(c, aes(x = `I(RAgecat5 == 1)`, y = rownames(c))) +
  geom_point(size = 3, color = "#ff9800") +
  geom_errorbar(aes(xmin = ci_l, xmax = ci_u), width = 0.1, color = "#ff9800") +
  labs(
    title = "% aged 17-34 by region",
    y = "Government Office Region",
    x = "% Aged between 17 and 34'"
  ) +
  theme_minimal(base_size = 14)

Question 5

  • What is the 95% confidence interval for the proportion of people significantly interested in politics in the North East?
  • Is the proportion likely to be different in London? In what way?
  • What is the region of the UK for which the estimates are likely to be least precise?

Q5 Answer (1)

Proportion of those interested in politics by region:

c<-svyby(~I(Politics==1 | Politics==2),
            by=~as_factor(GOR_ID),
            svyciprop,
            design=bsa17.s,
            vartype = "ci")

round(100*c[-1],1)
                           I(Politics == 1 | Politics == 2) ci_l ci_u
A North East                                           33.4 26.6 40.9
B North West                                           41.9 36.1 48.0
D Yorkshire and The Humber                             35.6 29.1 42.6
E East Midlands                                        36.9 32.9 41.1
F West Midlands                                        36.3 31.5 41.5
G East of England                                      47.2 41.4 53.1
H London                                               54.2 47.2 61.1
J South East                                           44.6 38.7 50.8
K South West                                           46.5 39.4 53.8
L Wales                                                38.6 27.7 50.7
M Scotland                                             42.7 36.0 49.8

Plotting CIs

Let’s now plot this output:

ggplot(c, aes(x = `I(Politics == 1 | Politics == 2)`, y = rownames(c))) +
  geom_point(size = 3, color = "#ff9800") +
  geom_errorbar(aes(xmin = ci_l, xmax = ci_u), width = 0.1, color = "#ff9800") +
  labs(
    title = "% interested in politics by region",
    y = "Government Office Region",
    x = "% Interested in politics 'a great deal'or 'quite a lot'"
  ) +
  theme_minimal(base_size = 14)

Question 5: Answer

  • The 95% confidence interval for the proportion of people significantly interested in politics in the North East is 26.6, 40.9.
  • By contrast, it is 47.2, 61.1 in London.
  • The region with the lowest precision of estimates (i.e. the widest confidence interval) is Wales, with more than 20 percentage point difference between the upper and lower bounds of the confidence interval.

Question 6

  • Using interest in politics as before, and three category age RAgecat5:
  • Produce a table showing the proportion of respondents significantly interested in politics by age group and gender
  • Assess whether the age difference in interest for politics is similar for each gender.
  • Is it fair to say that men aged under 35 are more likely to declare being interested in politics than women aged 55 and above?

Question 6 - Answer

round(
      100*
        svyby(~I(Politics==1 | Politics==2),
              by=~as_factor(RAgecat5)+as_factor(Rsex),
              svyciprop,
              design=bsa17.s,
              vartype = "ci")[c(-8,-4),c(-2,-1)],1)
             I(Politics == 1 | Politics == 2) ci_l ci_u
17-34.Male                               42.7 37.6 48.0
35-54.Male                               50.8 46.6 54.9
55+.Male                                 57.8 53.9 61.6
17-34.Female                             26.3 22.0 31.1
35-54.Female                             34.1 30.6 37.8
55+.Female                               43.0 39.6 46.5
  • Males and females aged 55+ tend to be more involved in politics than those who are younger.

  • CIs for the proportion of men under 35 and women above 55 interested in politics overlap; it is unlikely that they differ significantly in the population.

Domain estimation - unresolved questions

  • Unique PSUs within strata
  • R vs Stata
  • Results may differ
  • Broader issue: small subpopulations and design-based estimation

3. Dealing with the absence of survey design variables

Estimating employment by region

  • We are using the Safeguarded Quarterly Labour Force Survey, April-July 2022.
  • As a rule, EUL versions of the LFS do not include survey design variables.
  • The LFS come with two weighting variables:
    • pwt22 for estimation with the whole sample
    • piwt22 for earnings estimation of respondents in employment (also accounting for the high level of non response for the earnings variables)
  • Estimation without accounting for sample design is likely be biased and should be reported as such including warnings, even if its nature (over or underestimation) and size are not known.

  • An alternative is to look for design factors tables published by the data producer which could be used to correct for the bias.

  • The Office for National Statistics regularly publishes Deft tables for the LFS, but only for their headline statistics.

Regional employment rates using R

  • Let’s first produce uncorrected estimates of the regional population.
  • We will still use the survey design functions, but declare a SRS design
lfs<-read_dta(
              (paste0(
                datadir,
                "lfs/UKDA-8999-stata/lfsp_aj22_eul_pwt22.dta"
                )
               )
              )%>%
     select(PWT22,PIWT22,GOVTOF2,URESMC,ILODEFR)

table(as_factor(lfs$GOVTOF2))

          Does not apply                No answer               North East 
                       0                        0                     3460 
              North West Yorkshire and Humberside            East Midlands 
                    7283                     6435                     5701 
           West Midlands          East of England                   London 
                    6089                     7140                     6523 
              South East               South West                    Wales 
                    9699                     7237                     3750 
                Scotland         Northern Ireland 
                    5626                     7049 

For some reason, the ONS use a distinct category for Merseyside, but not the GOVTOF variable in our dataset. We will correct this using another, more detailed region variable: URESMC.

lfs<-lfs%>%
     mutate(
            govtof=ifelse(URESMC==15,3,GOVTOF2)
            )        # Identifying Merseyside using URESMC

lfs$govtof.f<-as.ordered(lfs$govtof)                           # Converting into factor
levels(lfs$govtof.f)<-c(names(attr((lfs$GOVTOF2),"labels")[3:4]),
                        "Merseyside",
                        names(attr((lfs$GOVTOF2),"labels")[5:14])) # Adding factor levels from existing labels
table(lfs$govtof.f)

              North East               North West               Merseyside 
                    3460                     6290                      993 
Yorkshire and Humberside            East Midlands            West Midlands 
                    6435                     5701                     6089 
         East of England                   London               South East 
                    7140                     6523                     9699 
              South West                    Wales                 Scotland 
                    7237                     3750                     5626 
        Northern Ireland 
                    7049 

Let us now examine the confidence intervals for the percentage of persons in employment:

lfs.s<-svydesign(ids=~1,weights=~PWT22,data=lfs) 
d<-      svyby(~I(ILODEFR==1),
               by=~govtof.f,
               svyciprop,
               vartype="se",
               design=lfs.s)

df<-100*data.frame(d[-1])
names(df)<-c("Empl.","SE")
df["Low.1"]<-round(df$Empl.-(1.96*df$SE),1)
df["High.1"]<-round(df$Empl.+(1.96*df$SE),1)
df
                            Empl.        SE Low.1 High.1
North East               45.38046 0.9582962  43.5   47.3
North West               47.58783 0.7480920  46.1   49.1
Merseyside               46.83894 1.8518811  43.2   50.5
Yorkshire and Humberside 47.65947 0.7086235  46.3   49.0
East Midlands            48.75861 0.7679729  47.3   50.3
West Midlands            47.88792 0.7708677  46.4   49.4
East of England          50.00910 0.6768140  48.7   51.3
London                   52.42943 0.7276242  51.0   53.9
South East               49.80544 0.5905629  48.6   51.0
South West               49.65830 0.6869256  48.3   51.0
Wales                    46.71336 0.9558361  44.8   48.6
Scotland                 50.16589 0.7813547  48.6   51.7
Northern Ireland         45.40227 0.6932599  44.0   46.8

We can now import the design factors from the LFS documentation. This has to be done by hand.

df$deft<-c(0.8712,1.0857,1.3655,
           1.0051,0.9634,1.0382,
           0.8936,1.3272,0.9677,
           0.9137,1.0012,1.0437,
           0.7113)
df["Low.2"]<-round(df$Empl.-(1.96*df$SE*df$deft),1)
df["High.2"]<-round(df$Empl.+(1.96*df$SE*df$deft),1)

# Cleaning up the labels
#rownames(df)<-substr(rownames(df),9,nchar(rownames(df)))
df
                            Empl.        SE Low.1 High.1   deft Low.2 High.2
North East               45.38046 0.9582962  43.5   47.3 0.8712  43.7   47.0
North West               47.58783 0.7480920  46.1   49.1 1.0857  46.0   49.2
Merseyside               46.83894 1.8518811  43.2   50.5 1.3655  41.9   51.8
Yorkshire and Humberside 47.65947 0.7086235  46.3   49.0 1.0051  46.3   49.1
East Midlands            48.75861 0.7679729  47.3   50.3 0.9634  47.3   50.2
West Midlands            47.88792 0.7708677  46.4   49.4 1.0382  46.3   49.5
East of England          50.00910 0.6768140  48.7   51.3 0.8936  48.8   51.2
London                   52.42943 0.7276242  51.0   53.9 1.3272  50.5   54.3
South East               49.80544 0.5905629  48.6   51.0 0.9677  48.7   50.9
South West               49.65830 0.6869256  48.3   51.0 0.9137  48.4   50.9
Wales                    46.71336 0.9558361  44.8   48.6 1.0012  44.8   48.6
Scotland                 50.16589 0.7813547  48.6   51.7 1.0437  48.6   51.8
Northern Ireland         45.40227 0.6932599  44.0   46.8 0.7113  44.4   46.4

In some regions, CI have widened (i.e. London) whereas they have shrunk in others (i.e. the North East)

Thank you for your attention

Comments, feedbacks and questions: pierre.walthery@manchester.ac.uk

Lumley, Thomas. 2024. “Survey: Analysis of Complex Survey Samples.”
NatCen Social Research. 2024. “British Social Attitudes Survey, 2017.” UK Data Service. https://doi.org/10.5255/UKDA-SN-8450-2.
Office for National Statistics. 2025. “Quarterly Labour Force Survey, April - June, 2022.” UK Data Service. https://doi.org/10.5255/UKDA-SN-8999-4.